DAMD hashtag cooccurrence graph (re)construction

Let's look at the shape of the data about DAMD and see how to computationally construct a graph, and how that compares to doing so with an interactive tool, such as Table 2 Net.

Reading the data from a file into Python

Ok we have received a nice data file. First we can take a look with Excel, or a text editor. Assuming the file 20170718 hashtag_damd uncleaned.csv has been placed in the same directory as this notebook, we can also take a peek in Python.


In [2]:
# first we want some Python tools to make our lives easier

import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
with open("20170718 hashtag_damd uncleaned.csv") as fd:
    for row in fd.readlines()[:3]:
        print(row)


"","tweet_id","user_id","user_name","reply_to_id","created","message","geodata","place_id","place_type","place_name","place_country","language","retweet_count","hashtags","user_mentions_name","user_mentions_id","urls","media_id","media_type","media_url"

"1","885401672448589824",43302304,"Motor Mavens","NULL","Thu Jul 13 07:33:03 +0000 2017","The @oemaudioplus #86Vantage's interior just looks so #DAMD upscale! And sounds upscale too. The crisp sound and... https://t.co/bUXNNPNHbQ","NULL","NULL","NULL","NULL","NULL","en",0,"86Vantage;DAMD","OEM AUDIO PLUS","137555927","http://fb.me/6IdLxl68T","NULL","NULL","NULL"

"2","772829925279752196",94512824,"Caspar de Kiefte","NULL","Mon Sep 05 16:13:07 +0000 2016","#DAMD -> via Kunstenbond onderdeel van internationaal netwerk waaronder Directors Guild of America https://t.co/qGBOMakAQ8","NULL","NULL","NULL","NULL","NULL","nl",0,"DAMD","NULL","NULL","http://damd.nl/nieuws/damd-via-kunstenbond-verbonden-in-internationaal-netwerk/","NULL","NULL","NULL"

That looks like a comma-separated value (CSV) file. There are many other kinds of files for data, but these are quite typical. In a CSV, each line is a data item (a tweet in this case), and columns are variables for each item. We call such a thing a data frame.


In [4]:
damd = pd.read_csv("20170718 hashtag_damd uncleaned.csv")

What variables do we have?


In [5]:
damd.columns


Out[5]:
Index(['Unnamed: 0', 'tweet_id', 'user_id', 'user_name', 'reply_to_id',
       'created', 'message', 'geodata', 'place_id', 'place_type', 'place_name',
       'place_country', 'language', 'retweet_count', 'hashtags',
       'user_mentions_name', 'user_mentions_id', 'urls', 'media_id',
       'media_type', 'media_url'],
      dtype='object')

Let's decide to use the tweet_id as index. It is an unique identifier for the tweets.


In [6]:
damd = pd.read_csv("20170718 hashtag_damd uncleaned.csv", index_col="tweet_id")
damd.head(3)


Out[6]:
Unnamed: 0 user_id user_name reply_to_id created message geodata place_id place_type place_name place_country language retweet_count hashtags user_mentions_name user_mentions_id urls media_id media_type media_url
tweet_id
885401672448589824 1 43302304 Motor Mavens NaN Thu Jul 13 07:33:03 +0000 2017 The @oemaudioplus #86Vantage's interior just l... NaN NaN NaN NaN NaN en 0 86Vantage;DAMD OEM AUDIO PLUS 137555927 http://fb.me/6IdLxl68T NaN NaN NaN
772829925279752196 2 94512824 Caspar de Kiefte NaN Mon Sep 05 16:13:07 +0000 2016 #DAMD -> via Kunstenbond onderdeel van inte... NaN NaN NaN NaN NaN nl 0 DAMD NaN NaN http://damd.nl/nieuws/damd-via-kunstenbond-ver... NaN NaN NaN
828122222111764480 3 798400767975686144 Bec NaN Sun Feb 05 06:04:58 +0000 2017 @Budah96 @sarahbuya4 #Damd Olivia went and too... NaN NaN NaN NaN NaN en 0 Damd;Damd;Scandal;sogood Spider-Paco The 🌮;Sarah 165599878;53990004 NaN NaN NaN NaN

Hashtag co-occurrence graph creation

To find patterns in the data, we might look at #hashtags, and if we can identify some interesting patterns in them. Cooccurrence is a useful thing to look at, and can easily be done in Twitter data.

We might want to bipartite graph ("network") $g = \langle N, V \rangle$, where $N = \{{node}_1, {node}_2 \ldots {node}_n\}$ is a set of nodes ("spheres"), and $V = \{{\langle source, target \rangle_1, \langle source, target \rangle _2 \ldots \langle source, target \rangle _m }\}$ set of edges ("lines") of tweets and hashtags, to analyze hashtag co-occurrence.

A bipartite graph has two types of nodes, which are not connected within the type, only across. In our case, hashtags are connected to tweets, but tweets are not directly connected to tweets, and hashtags are not directly connected to hashtags. Makes sense, right?

This data manipulation process can be done with Table 2 Net. But doing so programmatically is a different way to do it. We will use Python library called NetworkX.

Below is a Gephi visualization of a graph made with Table 2 Net, coloured by node type red for tweets and green for hashtags, and showing labels for the hashtag nodes with degree of 15 or larger. We have used the algorithm ForceAtlas2 in Gephi for positioning the nodes. The central node, hashtag damd has been hidden, because it carries no information.

First let's take a peek at the shape of the hashtags, how they are stored in the data we have received.


In [7]:
damd.hashtags.head()


Out[7]:
tweet_id
885401672448589824                86Vantage;DAMD
772829925279752196                          DAMD
828122222111764480      Damd;Damd;Scandal;sogood
869614229619224576                          Damd
862237577822318592    S206;DAMD;SUBARU;TOPRACING
Name: hashtags, dtype: object

We see that the hashtag column is itself a semicolon separated list, and our data is kind of three dimensional. We need to split it up.

From reading the documentation, we know that nx.Graph.add_edge() requires input as a tuple (source, target), describing one edge. For each tweet, we generate a list of it's hashtags, and then add those edges to the graph one by one. So, from the original data shape

tweet1 hashtag1;hashtag2;hashtag3
tweet2 hashtag9;hashtag4
.
.
.

We create an intermediary data shape for line 5

tweet1 hashtag1
tweet1 hashtag2
tweet1 hashtag3
tweet2 hashtag9
tweet2 hashtag4
.
.
.

This suits what the NetworkX API expects.

Conveniently NetworkX automatically creates the nodes, so we don't have to think about them. How can it automatically know what the nodes are, if it only looks at links?


In [8]:
def buildHashtagCooccurrenceGraph(tweets):
    g = nx.Graph(name="Hashtag co-occurrence bipartite")
    for tweet, hashtags in damd.hashtags.astype(str).map(lambda l: l.split(';')).items():
        g.add_node(tweet, Type="tweet_id")
        for hashtag in hashtags:
            g.add_edge(tweet, hashtag.lower())
    return g

In [9]:
g = buildHashtagCooccurrenceGraph(damd)

Now, let's briefly inspect the graph g we created.


In [10]:
print(nx.info(g))


Name: Hashtag co-occurrence bipartite
Type: Graph
Number of nodes: 2760
Number of edges: 4798
Average degree:   3.4768

Save to file, for opening in Gephi.


In [9]:
nx.write_gexf(g, "hashtag-cooccurrence-bipartite-with-python.gexf")

Compare the results of graph creation with Table 2 Net and Python

Read in the graph made with Table 2 Net.


In [10]:
g_table2net = nx.read_gexf("hashtag-cooccurrence-bipartite-with-table2net.gexf")
print(nx.info(g_table2net))


Name: 
Type: Graph
Number of nodes: 2760
Number of edges: 4798
Average degree:   3.4768

After poking around in Gephi for half an hour setting colours and filters, positioning with ForceAtlas2 and outputting an image, here is a visualization of the graph. It should be equal to the one above, which was visualized from a graph constructed from the data with Table 2 Net.

In graph theory, "isomorphism" (ἴσος isos "equal", and μορφή morphe "form" or "shape") means that graphs are of the same shape. Why do want to know this? We want to inspect if we successfully reproduced the process that Table 2 Net did.


In [11]:
# This algoritm is not guaranteed, but it is fast
nx.isomorphism.fast_could_be_isomorphic(g, g_table2net)


Out[11]:
True

Did we "open the black box" of Table 2 Net and Gephi?